430 research outputs found
SPU-Net: Self-Supervised Point Cloud Upsampling by Coarse-to-Fine Reconstruction with Self-Projection Optimization
The task of point cloud upsampling aims to acquire dense and uniform point
sets from sparse and irregular point sets. Although significant progress has
been made with deep learning models, they require ground-truth dense point sets
as the supervision information, which can only trained on synthetic paired
training data and are not suitable for training under real-scanned sparse data.
However, it is expensive and tedious to obtain large scale paired sparse-dense
point sets for training from real scanned sparse data. To address this problem,
we propose a self-supervised point cloud upsampling network, named SPU-Net, to
capture the inherent upsampling patterns of points lying on the underlying
object surface. Specifically, we propose a coarse-to-fine reconstruction
framework, which contains two main components: point feature extraction and
point feature expansion, respectively. In the point feature extraction, we
integrate self-attention module with graph convolution network (GCN) to
simultaneously capture context information inside and among local regions. In
the point feature expansion, we introduce a hierarchically learnable folding
strategy to generate the upsampled point sets with learnable 2D grids.
Moreover, to further optimize the noisy points in the generated point sets, we
propose a novel self-projection optimization associated with uniform and
reconstruction terms, as a joint loss, to facilitate the self-supervised point
cloud upsampling. We conduct various experiments on both synthetic and
real-scanned datasets, and the results demonstrate that we achieve comparable
performance to the state-of-the-art supervised methods
Parsing is All You Need for Accurate Gait Recognition in the Wild
Binary silhouettes and keypoint-based skeletons have dominated human gait
recognition studies for decades since they are easy to extract from video
frames. Despite their success in gait recognition for in-the-lab environments,
they usually fail in real-world scenarios due to their low information entropy
for gait representations. To achieve accurate gait recognition in the wild,
this paper presents a novel gait representation, named Gait Parsing Sequence
(GPS). GPSs are sequences of fine-grained human segmentation, i.e., human
parsing, extracted from video frames, so they have much higher information
entropy to encode the shapes and dynamics of fine-grained human parts during
walking. Moreover, to effectively explore the capability of the GPS
representation, we propose a novel human parsing-based gait recognition
framework, named ParsingGait. ParsingGait contains a Convolutional Neural
Network (CNN)-based backbone and two light-weighted heads. The first head
extracts global semantic features from GPSs, while the other one learns mutual
information of part-level features through Graph Convolutional Networks to
model the detailed dynamics of human walking. Furthermore, due to the lack of
suitable datasets, we build the first parsing-based dataset for gait
recognition in the wild, named Gait3D-Parsing, by extending the large-scale and
challenging Gait3D dataset. Based on Gait3D-Parsing, we comprehensively
evaluate our method and existing gait recognition methods. The experimental
results show a significant improvement in accuracy brought by the GPS
representation and the superiority of ParsingGait. The code and dataset are
available at https://gait3d.github.io/gait3d-parsing-hp .Comment: 16 pages, 14 figures, ACM MM 2023 accepted, project page:
https://gait3d.github.io/gait3d-parsing-h
Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework
Action recognition from videos, i.e., classifying a video into one of the
pre-defined action types, has been a popular topic in the communities of
artificial intelligence, multimedia, and signal processing. However, existing
methods usually consider an input video as a whole and learn models, e.g.,
Convolutional Neural Networks (CNNs), with coarse video-level class labels.
These methods can only output an action class for the video, but cannot provide
fine-grained and explainable cues to answer why the video shows a specific
action. Therefore, researchers start to focus on a new task, Part-level Action
Parsing (PAP), which aims to not only predict the video-level action but also
recognize the frame-level fine-grained actions or interactions of body parts
for each person in the video. To this end, we propose a coarse-to-fine
framework for this challenging task. In particular, our framework first
predicts the video-level class of the input video, then localizes the body
parts and predicts the part-level action. Moreover, to balance the accuracy and
computation in part-level action parsing, we propose to recognize the
part-level actions by segment-level features. Furthermore, to overcome the
ambiguity of body parts, we propose a pose-guided positional embedding method
to accurately localize body parts. Through comprehensive experiments on a
large-scale dataset, i.e., Kinetics-TPS, our framework achieves
state-of-the-art performance and outperforms existing methods over a 31.10% ROC
score.Comment: Accepted by IEEE ISCAS 2022, 5 pages, 2 figures. arXiv admin note:
text overlap with arXiv:2110.0336
- …